arXiv:2407.14068v2  [astro-ph.IM]  3 Jun 2025
Draft version June 12, 2025
Typeset using LATEX preprint style in AASTeX631
A “Rosetta Stone” for Studies of Spatial Variation in Astrophysical Data:
Power Spectra, Semivariograms, Structure Functions, and More
Benjamin Metha
1, 2 and Sabrina Berger
1, 2, 3, ∗
1University of Melbourne, 1 Tin Alley, Parkville 3050 Victoria, Australia
2Australian Research Council Centre of Excellence for All-Sky Astrophysics in 3-Dimensions, Australia
3Research School of Astronomy and Astrophysics, Australian National University, Canberra, ACT 2611, Australia
ABSTRACT
From the turbulent interstellar medium to the cosmic web, astronomers in many differ-
ent fields have needed to make sense of spatial data describing our Universe. Through
different historical choices for mathematical conventions, many different subfields of
spatial data analysis have evolved their own language for analysing structures and quan-
tifying correlation in spatial data. Because of this history, terminology from a myriad of
different fields is used, often to describe two data products that are mathematically iden-
tical. In this Note, we define and describe the differences and similarities between the
power spectrum, the two-point correlation function, the covariance function, the semi-
variogram, and the structure functions, in an effort to unify the languages used to study
spatial correlation. We also highlight under which conditions these data products are
useful and describe how the results found using one method can be translated to those
found using another, allowing for easier comparison between different subfields’ native
methods. We hope for this document to be a “Rosetta Stone” for translating between
different statistical approaches, allowing results to be shared between researchers from
different backgrounds, facilitating more cross-disciplinary approaches to data analysis.
1. INTRODUCTION
Things that are close to each other tend to be similar in other ways. This general principle, some-
times referred to as Tobler’s First Law of Geography (Tobler 1970), describes everything. Hot days
tend to follow hot days, and cold days tend to follow cold ones. People who live in the same area
tend to vote similarly, earn a similar amount of money, drive similar cars, and live to about the same
age. On the smallest scales, mosquitoes that are captured from the same local environment are more
genetically similar than those that are taken from separate locations. The concentration of malaria
in their bloodstreams will be more similar if they are drawn from nearby locations, as will the level
of drug resistance within the malaria parasites. On the largest scales, the structure of the Universe
also follows Tobler’s First Law: regions that are rich in matter tend to be close to other dense regions
of the Cosmos.
methab@student.unimelb.edu.au
sabrina.berger@student.unimelb.edu.au
∗Both SB and BM contributed equally to this work, and should be treated as corresponding authors and first-authors
for citation purposes.

2
It is of no surprise, then, that mathematicians from many diverse disciplines have made attempts
to capture the ways that the similarity between different things depends on their distance. In its
theoretical form, mathematics is a language that precisely and unambiguously describes the Universe.
It is both invented and discovered: we invent words to describe the things we see, and discover
relationships between them.
As with any language, mathematics has dialects. As these different areas of study evolved, all
of them independently tried to solve the problem of how to best describe the ways in which nearby
things resemble each other. Like finches isolated on separate islands of the Galapagos, over time, each
subfield organically developed its own set of methods, tools, and techniques, each of which capture
the same kind of characteristics about a spatial data set.1
Today, researchers who come from separate “islands of study” have great difficulty understanding
one another. The investigator who wishes to read widely, connect to other cultures, and discover
cross-disciplinary approaches to analyse their data must sail through treacherous waters, infested
with scary mathematical functions such as variograms and semivariograms; the two-point correlation
function, autocorrelation function, autocovariance function and cross-correlation function; the power
spectrum and power spectral density; the energy spectrum; and structure functions of the first, second,
and higher orders. This is a shame, as it is often the case that a problem in one field has already
been solved by a mathematical approach that is native to a different field – but unless we can talk
to each other, there is no way that we can learn from each other.
The purpose of this document is to help solve this problem. We are two graduate students in
astronomy who have been taught how to analyse spatial data on two different mathematical-linguistic
islands (Benjamin Metha speaks the language of geostatistics, Section 4.1; and Sabrina Berger’s
native tongue is the Fourier method of power spectrum analysis, Section 4.2). In an attempt to
understand each others’ methods, we have crossed a dense jungle of nomenclature. We leave this
Note behind as a collection of terminological trail markers, so that others can traverse the rough
seas between each island with ease. Our intended reader is both the new student and the expert
in one of these particular methods, who is curious about the similarities and differences between
their approach and other methods that have been tried and tested in the literature. We hope that
this piece serves as a “Rosetta stone,” allowing methods from one discipline to be translated into
the language of another, enabling results to be shared between previously-isolated communities, and
facilitating cross-disciplinary collaboration.
Before we delve into the myriad of languages that are used to describe spatial data, it helps to
have a common language that we can speak. In Section 2, we present some important tools from
classical statistics that capture the broad properties of random variables: the mean (what value is
a random variable on average?), variance (how similar do observed values of the random variable
tend to be?), covariance (how much do changes in this random variable tend to imply changes in
another random variable?), and correlation (a normalised version of the covariance that shows how
much information a random variable contains about another). Once we have built a solid bedrock,
we then extend these definitions to work on random fields (random variables that are observed over
a spatial domain) and time series data (random variables that are observed over time) in Section
3. While this content may be familiar to a large fraction of our audience, we have discovered that
1 Indeed, one could use Tobler’s First Law of Geography to describe this kind of mathematical-linguistic speciation:
scientific disciplines that are close to each other (in terms of both the kinds of things that they study and where and
when these investigations historically happened) tend to follow similar statistical naming conventions.

3
there is still a tangle of terminology and some disagreement on definitions in these shallow waters.
We encourage even the seasoned statistician to at least skim this Section, pausing to ensure that the
definitions they are familiar with match our own.
We then embark on an exploration into the way that spatial data is analysed in three different fields
of study. We begin in Section 4.1 to cover how geostatisticians use the semivariogram to characterise
correlation in spatially-varying data. In Section 4.2, we leave the real world behind and travel into
Fourier space to explore how cosmologists use the power spectrum to extract the same characteristics.
Next, in Section 4.3, we visit the chaotic realm of turbulence. We learn how two tools used by fluid
dynamicists to capture structure in stochastic environments, the energy spectrum and the structure
functions, are related to the other approaches that we have encountered. We reflect on our journey
in Sections 5 and 6, drawing bridges between these different domains, and discussing which data
products each approach is best suited for, and how results found with one approach are connected
to the other approaches. Finally, we provide a glossary that gives a plain English definition to all of
the mathematical terms that we have encountered in this Note, to assist the curious researcher who
wants to read more about the approaches to spatial data analysis used in other fields.
To accompany this pedagogical article, we have also created an interactive Jupyter notebook2 that
contains Python implementations of every method that we cover on this tour, and a selection of
random fields on which they can be applied. We hope that this is useful for the reader (i) to gain
some hands-on intuition on how these methods work numerically and what information they tell you,
and (ii) to have some off-the-shelf implementations of these methods available that you (yes, you!)
can use for your own research.
To keep this Note relatively brief, we limit our investigations to methods that capture only the first-
and second-order structure of a random field.3 For this reason, some words that will not appear in this
manuscript that describe related ways of capturing spatial correlation4 include the bispectrum, the
trispectrum, the three-point correlation function, the scattering transform (First applied to cosmology
in Cheng et al. (2020)), nor any discussion of machine and deep learning (e.g. convolutional neural
networks).
2. FUNDAMENTALS
When we make measurements in the real world, the data that we acquire are always uncertain, and
the processes that produce the data that we see often contain an element of randomness. In order to
make sense of what we see, it is important to understand how much we don’t know about what we
observe. To understand the external, uncertain world that we see around us, we must use statistics.
To capture this uncertainty without losing any information, statisticians define the things that we
observe in terms of random variables.
A random variable is a mathematical way of representing outcomes that you are not sure about.
When you try to measure the value of a random variable, there are a range of different possible
answers that you could get. Ask a person on the street their age, and you could feasibly get any
number between two and one hundred.5 If you repeat this experiment with a different person, you
will probably get a different number. We use capital letters (X, Y ) to denote these random variables,
2 Available on SB’s GitHub.
3 This information is sufficient to completely describe Gaussian random fields. For other kinds of random fields, higher-
order statistics are needed to capture deviations from Gaussianity – but that’s another story. Even if your field is not
Gaussian, you can still learn a lot about how it behaves from its first and second-order statistics.
4 Except for right here, and in the Glossary.
5 Note: we do not recommend that you run this experiment in real life, as it is rude to ask strangers their age.

4
and lowercase letters with subscripts to denote the different values that we measure for them. In this
scenario, let X be the random variable “the age of a person on this street”. Then x1 is the age that
the first person that we ask tells us, x2 is the age that the second person tells us, and so on. If we
were to also ask each person for their height, then we could let Y be the random variable “the height
of a person on this street”, and this random variable would have the sampled values y1, y2, and so
on for each person.
When we analyse random variables, our goal is to get some idea of their distribution and how
they depend on other variables to inform our understanding about what’s going on. The probability
distribution of a random variable tells you all of the possible values that the random variable can
take, and how likely each of them is to occur. By continually taking many samples of a random
variable (asking lots of people on the street their age, in our example), we get some idea of what
the typical values of a random variable are, and how much they tend to vary. If we were to conduct
our experiments just outside a primary school, we would get a different typical value for the ages of
people than we would if we tried this experiment in a jazz bar. On the other hand, if we compared
the ages of a sample of random people in a primary school to the ages of a sample of random people
in a third-grade maths classroom, we might get similar typical values for both random variables, but
we would find that the values that we measure tend to vary a lot more in the first case than in the
second one. We make these concepts more formal below.
2.1. Measures of centre and measures of spread: mean, variance, and standard deviation
Let X be a random variable that we have measured n times, with values of x1, x2, . . . , xn. We want
to know two things about the distribution of this random variable. Firstly, what is the value of this
random variable roughly (mean)? Secondly, how similar do observed values of the random variable
tend to be (variance)6?
To answer our first question, we define the sample mean to be:
¯x = d
E[x] = 1
n
n
X
i=1
xi.
(1)
This value is an estimator7 for the expectation value (or expected value) of a random variable. The
expectation value of a random variable is the average value across all values that appear in its
population. By population, we mean the set of all possible observations of a random variable. The
frequency of each individual value it can take is the probability of that event (defined below in both
the discrete and continuous case). The set of each probability among possible outcomes comprises
the random variable’s probability distribution. If we have samples of a random variable, we can use
Equation 1 to estimate the expected value of a random variable. However, we could never hope to get
our hands on the exact expectation value of a random variable from a finite number of measurements
alone (if we’re sampling from an infinite population). Instead, we would need to know the probability
distribution of the random variable to calculate it exactly.
6 In the world of probability, answering these questions is equivalent to estimating the first (raw) and second (central)
moments of the random variable’s probability distribution function. You can find a definition of moments, their close
cousins the cumulants, and the way that they are related here.
7 In this Note, we use estimator to mean a statistical estimator which uses a rule, such as Equation 1 to estimate a
quantity – see the glossary for further details.

5
The expectation value of a discrete random variable, X, is:
E[X] ≡
∞
X
i=1
xipi,
(2)
where pi is the probability mass function, or the discrete probability distribution function. That is,
each value of pi is the probability that we will observe X to have a value of xi (in mathematical
language, this is written as pi = P(X = xi)), and xi are all of the possible outcomes of X.
Sometimes, there are too many values that a random variable can take (for example, if the random
variable in question can take any real number as a value), and this summation definition of the
expectation value does not work. For times like this, the expectation value of a random variable can
also be defined as follows:
E[X] ≡
Z ∞
−∞
xp(x)dx,
(3)
where p(x) is the probability density function of our random variable X. The values of p(x) are
defined so that the integral of p(x) between x1 and x2 is the probability that when we observe X, it
lies between x1 and x2. In mathematical language, this is written as P(x1 ≤X ≤x2) =
R x2
x1 p(x)dx.
In Equations 2 and 3, we defined the expectation value, E[X], which we cannot get exactly right
unless we know the exact distribution the random variable X was drawn from. The most common
estimator is the sample mean (which we will denote by ¯x in this Note) which is the average computed
from a finite number of observations of X (Equation 1).
To answer our second question (how much does this random variable tend to vary?), we measure
the spread of a random variable by calculating the variance of our sample. As we did with mean,
we present the true definition of the variance as well as a common estimator. To compute the true
variance, we calculate the expected value of the squared difference between the values of X that we
could measure, and their expected value E(X):
Var[X] ≡E[(X −E(X))2].
(4)
In this definition, Var[X] depends on the distribution of our random variable, and not on just a
few samples that we observe – that is, we need to know the distribution of the random variable to
compute the variance, or at least the first and second moments. Since we hardly ever know the
full distribution of a random variable in practice, we like to estimate the variance with the unbiased
sample variance. This is computed by taking the average squared difference of each data point from
the sample mean:
d
Var(X) =
1
n −1
n
X
i=1
(xi −¯x)2,
(5)
where the wide hat denotes that this is a variance estimator.
In both of these definitions, the units of variance will be the square of the units of the original
variable. For this reason, it is often convenient to talk about the standard deviation, defined to be
the square root of the variance, to describe how much the random variable tends to randomly vary.
Because it has the same units as the original thing that was measured, they are easy to compare.
Often, the symbol σ (sigma) is used to denote the standard deviation.

6
Wait, why are we dividing by n −1 and not n?
Technically, the above Equation (5) is not exactly an arithmetic average, because we divide by
n −1 and not n. If we have only samples from a population of a random variable, we only
know the sample mean and not the true expectation value. If we know the expectation value,
E(X), we can determine the true variance as Var(X) = 1
n
Pn
i=1(xi −E(X))2. We usually don’t
know E(X), though! If we calculate the sample variance using our best estimate of the sample
mean as d
Var(X) = 1
n
Pn
i=1(xi −¯x)2, the result will be biased to be slightly lower than the true
value of the variance.. Using n −1 instead of n is called Bessel’s correction, and it corrects for
this bias, allowing us to estimate the true variance in an unbiased way.
As n gets larger, the difference between using n −1 and n becomes negligible, and in the limit
where we have perfect knowledge of our random variable (as n →∞), the definitions that
use n and n −1 become exactly the same. For more intuition as to why dividing by n −1 is
the right thing to do, watch this 6 minute video (or 3 minutes at 2x speed) which walks you
through an explicit example showing that this is true. Alternatively, a full proof is presented
here.
The standard deviation has been very well studied, so we know a lot about how it is expected to
behave, making it a very useful statistic. Just by knowing the mean and standard deviation of a
random variable, we get a lot of information about its distribution. To show you what we mean,
let’s return to our example where x is the age of a randomly-selected person on the street. After
interviewing a good number of people, we calculate the sample mean of this random variable to
be ¯x = 40 years and compute its sample variance to be d
Var(X) = 100 years squared, giving us a
standard deviation of 10 years. From this information and the incorrect assumption that age follows a
normal distribution (it doesn’t because age can’t be negative!), and our knowledge of how a standard
deviation works, we can predict that about 68% of the people on this street will be between 30 and
50 years of age (less than one standard deviation from the mean), about 95% will be between 20 and
60 (less than two standard deviations away), and about 99.7% will be between 10 and 70 (less than
three standard deviations away). If the standard deviation was 5 instead, all of these age intervals
would only be half as wide. If the standard deviation was 0, then this would mean that there is no
variation in our data, and everyone on the street is exactly the same age.
2.2. Defining correlation for random variables
Now, let’s consider the case where we have two random variables, X and Y – say, age and height
of random people in a population. We take a different measurement of both X and Y for each of
n people, giving us samples with values of (x1, y1), (x2, y2), . . . , (xn, yn). We want to find a statistic
that can answer the following question: how related are these two random variables? Does knowing
about one of them give you any information about the other – that is, can you use someone’s height
to predict their age, or vice versa?
When the relationship between X and Y is monotonic, the word for what we are trying to measure is
correlation. If X and Y are positively correlated, then if X is measured to have a value that is higher
than its mean, then Y will likely be higher than its mean, too. If they are negatively correlated, then
if X is measured to be higher than its mean, Y is (on average) lower than its mean. The final option
is that these two random variables are independent: measuring X does not give us any additional
information about Y , and measuring Y does not give us any additional information about X.

7
As a first attempt to capture this information about this pair of variables, we define the covariance
between X and Y as:
Cov(X, Y ) ≡E[(X −E[X])(Y −E[Y ])].
(6)
However, we can’t usually know the expectation values of our random variables exactly. So in practice,
we instead estimate the covariance with the unbiased sample covariance as follows:8
d
Cov(X, Y ) =
1
n −1
n
X
i=1
(xi −¯x)(yi −¯y).
(7)
This statistic does a good job of capturing the information that we are interested in. If X tends to
rise when Y rises, and tends to fall when Y falls, then the values inside the summand will tend to be
positive – and d
Cov(X, Y ), on the whole, will be a positive number. The more often this happens, the
higher the value of d
Cov(X, Y ) – so d
Cov(X, Y ) is sensitive to the level of correlation between these two
variables. On the other hand, if X tends to fall when Y rises and rise when Y falls, then d
Cov(X, Y )
will more often than not be negative, so d
Cov(X, Y ) will be negative overall. On an unexpected third
hand, if there is no connection between X and Y , then the product (xi −¯x)(yi −¯y) will be equally
likely to be positive or negative for each pair (xi, yi) – so after averaging over all pairs, the covariance
should be close to zero.
The problem with using covariance to measure the similarity between two variables is that it is
difficult to interpret. Firstly, it has weird units – d
Cov(X, Y ) has the units of X multiplied by the
units of Y. If X is age in years and Y is height in feet, then d
Cov(X, Y ) will have the unusual units of
foot-years. If we measured a covariance value of 0.7 foot-years for these two random variables, would
you think that they are more correlated or less correlated than you expected?
Secondly, the covariance does not only depend on the correlation between these two variables, but
also on how much they vary individually. If the variance on X is larger, then the values of (xi −¯x)
will be larger, too – so the covariance will rise. In practise, this means that if we measured the heights
and ages of babies over a six month period and calculated the covariance, we would get a number
that is about four times smaller than we would if we tried the same experiment using a years’ worth
of data. At first glance, an age-height covariance of 0.06 foot-years seems to imply that two variables
are less strongly connected than if a pair of random variables had a covariance of 0.2 foot-years, but
unless we know what the variances of both X and Y are, we cannot say this for sure.
All things considered, the covariance could be more useful if it were normalised by the amplitudes
of the variances in a nice way that would rid of all the weird units. So that’s exactly what we do.
We define the correlation (ρ) between X and Y to be the covariance between them, normalised by
their standard deviations (the square root of their variances):
ρ(X, Y ) =
Cov(X, Y)
p
Var(X)Var(Y)
,
(8)
or the sample correlation version:
bρ(X, Y ) =
d
Cov(X, Y )
q
d
Var(X)d
Var(Y)
.
(9)
8 Note the
1
n−1 in the denominator; this is Bessel’s correction, and it works to unbias the covariance in the same way
that it unbiases the variance estimated in Equation 5.

8
The definition of correlation in equation 9 (often called the Pearson correlation coefficient9) has
some nice mathematical properties. For perfectly correlated data, ρ = 1. You can see this for yourself
by trying to compute ρ(X, X) (because X is perfectly correlated with itself) and simplifying:
ρ(X, X) =
d
Cov(X, X)
q
d
Var(X) \
Var(X)
=
d
Var(X)
d
Var(X)
= 1.
(10)
Similarly, for perfectly anticorrelated data (like X and −X), ρ = −1. For independent X and Y ,
ρ(X, Y ) will be zero, because the covariance between X and Y will be zero. The value of ρ, then,
can be thought of as a statistic that tells you (i) whether two variables are positively or negatively
correlated, and (ii) how strongly related they are on a scale of “not at all” (ρ = 0) to “completely”
(ρ = ±1).10
3. EXTENDING THESE DEFINITIONS FOR SPATIAL STATISTICS
Now that we have defined covariance and correlation for random variables, we are ready to discuss
how to naturally extend these definitions to learn about correlation within random fields. In doing so,
we will be introducing a lot of terminology. We warn the reader who has some familiarity with this
area to proceed with caution. Many of these functions have been given different names in different
fields.
We have chosen our definitions and notation as carefully as possible to serve as natural
extensions of the statistical definitions defined in the previous Section that everyone agrees upon.
We define a random field over a domain D such that at each point ⃗x in our domain, the value of
Z(⃗x) is a random variable. If we observe Z at n points within our domain, then we get a sample
of this random field, with values Z( ⃗x1), Z( ⃗x2), . . . , Z( ⃗xn). For the purposes of this document, we
assume that each Z(⃗xi) is a real number.
The naive statistical approach to analysing this type of data is to forget the location that each data
point was drawn from, and work with our data treating it like we have samples of a random variable
Z1, Z2, . . . , Zn. We could then use the tricks we learned in Section 2 to estimate the mean (Equation
1) and the variance (Equation 5) of our random field using this sample. However, if we do it this
way, we are losing a lot of information.
To illustrate this point, in Figure 1, we show six different random fields. All of these random fields
were constructed to have exactly the same mean and variance. Because of this, if you were to take
the standard statistical approach of forgetting about the locations of each pixel, you would not be
able to tell the difference between any of the distributions shown below. However, a quick visual
inspection11 is enough to tell you that all of these random fields are very different. Some are highly
ordered, with similar values of Z(⃗x) being consistently found in nearby spatial locations. Some are
highly chaotic, showing no clear structure at all. Some evolve much more rapidly over shorter spatial
scales than others, whereas others remain correlated over larger distances. When we forget about the
spatial location ⃗xi that each sample of Z is drawn from, we lose all of this information. Throughout
the 20th century, mathematicians from all over the world working in many different fields noticed
9 Named after Karl Pearson, who stole it from Francis Galton, who invented it all by himself – about 45 years after it
had been invented independently by French mathematician Auguste Bravais (Bravais 1844).
10 Note that Pearson’s correlation coefficient only captures linear correlations between variables. This can be a problem
if two variables depend on each other in a nonlinear way. A cute example that you can build at home is the random
variable pair where Y = sin(X), and X ranges from 0 to 20. If you know X, then you can predict Y perfectly, but if
you calculate the Pearson correlation coefficient between X and Y , it will be close to zero. Other kinds of correlation
coefficients have been invented to study non-linear correlations between random variables. See this nice notebook for
an overview.
11 This is the formal scientific way of saying “just look at it!”

9
Figure 1. Six different random fields. All of these fields were constructed to have the same means and the
same variance, but they look (and are) wildly different. In order to classify and quantify how these field are
different, we need to consider the spatial aspects of our data. The reader can see how we generated these
fields by taking a look at the corresponding Jupyter Notebook tutorial for this note.
this, and all came to the same conclusion: we can do better. This led to the invention of a number
of methodologies, all with the same end-goal in mind: to quantify spatial correlations.
3.1. Defining correlation for random fields 12
Looking at the data fields above, we can see that for at least some of them, nearby data points
appear to be correlated. That is, values of Z that are drawn from points that are close to each other
tend to have similar values. In other words, some of these fields seem to obey Tobler’s First Law
of Geography. If we use the methods of estimating mean and variance described in Section 2, we’ll
only be left with two numbers to describe each of these random fields. This is insufficient to describe
the complexities of these subjects. We want to find a way that we can compute correlations between
values of Z(⃗x) in the same random field that come from different spatial locations. Doing this kind
of analysis requires us to know the location (⃗x) that each measurement (Z(⃗x)) is taken from.
The most fundamental way to quantify this relation is to treat every measurement at each location
as being generated by a different, separate random variable. We can then look to see if there are any
12 or: Wait, why are B.M. and S.B. suddenly fighting?

10
relationships between any pair of measurements. To do this, we define the covariance matrix as the
symmetric, square matrix whose i, j-th element is the covariance between Z(⃗xi) and Z(⃗xj):
\
Cov(Z) =


d
Cov(Z(⃗x1), Z(⃗x1)) d
Cov(Z(⃗x1), Z(⃗x2)) · · · d
Cov(Z(⃗x1), Z(⃗xn))
d
Cov(Z(⃗x2), Z(⃗x1)) d
Cov(Z(⃗x2), Z(⃗x2)) · · · d
Cov(Z(⃗x2), Z(⃗xn))
...
...
...
...
d
Cov(Z(⃗xn), Z(⃗x1)) d
Cov(Z(⃗xn), Z(⃗x2)) · · · d
Cov(Z(⃗xn), Z(⃗xn))


(11)
Once we know the covariance matrix of our random field, we can then divide it by the estimated
variance of our random field (a real number which we can compute using standard statistical methods)
to produce the correlation matrix – the matrix whose i, j-th element is the Pearson correlation
coefficient between Z(⃗xi) and Z(⃗xj):
ˆρ(Z) =


ˆρ(Z(⃗x1), Z(⃗x1)) ˆρ(Z(⃗x1), Z(⃗x2)) · · · ˆρ(Z(⃗x1), Z(⃗xn))
ˆρ(Z(⃗x2), Z(⃗x1)) ˆρ(Z(⃗x2), Z(⃗x2)) · · · ˆρ(Z(⃗x2), Z(⃗xn))
...
...
...
...
ˆρ(Z(⃗xn), Z(⃗x1)) ˆρ(Z(⃗xn), Z(⃗x2)) · · · ˆρ(Z(⃗xn), Z(⃗xn))


(12)
This seems useful. However, this construction comes with an irritating caveat. Calculating the
covariance matrix requires multiple measurements of the same locations in our field. When we just
have one measurement, our entire covariance matrix is undefined and cannot be estimated, since we
would need to divide by zero in Equation 7.13
But what if we just had a single measurement of Z(⃗x) at each location ⃗x and we wanted to quantify
spatial correlation? Fortunately, there is a way forward – but it relies on us knowing (or at least
assuming) something about how our random fields behave. Looking at the first four random fields
in Figure 1, we can see that they all have two things in common. Firstly, on large scales, the mean
value of the data does not seem to vary with space – that is , that there is no global trend of the
mean value of the data in these fields changing over scales that are the size of the data field, as is
seen in the bottom-right panel of Figure 1. Secondly, the covariance between data at points ⃗x and ⃗y
does not seem to depend on their absolute locations, but only on their separation. In mathematical
language, this is the same as saying that at any two points ⃗x and ⃗y, given any separation vector ⃗r:
E[Z(⃗x)] = E[Z(⃗y)],
and
(13)
Cov[Z(⃗x + ⃗r), Z(⃗x)] = Cov[Z(⃗y + ⃗r), Z(⃗y)].
There are many different names for this condition. In the arena of signal-processing (where ⃗x is
usually a one-dimensional vector representing time), a random field (or time-varying signal) that
follows these two conditions is called weakly stationary, weak-sense stationary, wide-sense stationary,
or second-order stationary. These terms have been adopted to describe 2+ dimensional data in the
geostatistical literature – we will follow this convention and adopt the term second-order stationary
13 If, instead, we had a distribution of values of Z(⃗xi) at each location ⃗xi, or a way to estimate the distribution of Z(⃗xi)
at each location ⃗xi, we would be fine. In these scenarios, covariance and correlation matrices are perfectly reasonable
things to compute. The Python package numpy contains methods to compute a covariance matrix (numpy.cov) and a
correlation matrix (numpy.corrcoef) from an array of N samples of M random variables.

11
to describe these fields for the rest of this Note. Some people choose to simply call data that follows
this condition stationary, but people will often also use this word for a much stronger condition.14
Another term used for data that follows this condition (for example, in cosmology) is to say that it is
translationally-invariant, or homogeneous – but similar to the word stationary, these words are also
sometimes used to mean a different, stronger condition.15
No matter what you call it, if the random field Z(⃗x) follows this condition (13), then the covariance
between two data points depends only on their separation.
Because of this, we can define the
covariance function C(⃗r) to be the function that gives the covariance between anything separated
by ⃗r:
C(⃗r) = d
Cov(Z(⃗x + ⃗r), Z(⃗x)).
(14)
This time, the average in the expression for Cov (Equation 7) is computed over all pairs of points
separated by ⃗r (or in practise, separated by ⃗r). Since there is more than one point separated by each
⃗r, there is no problem computing this for most values of ⃗r.
Once we have this function, then we can divide it by the variance of our random field to get the
Pearson correlation coefficient between any pair of data points separated by ⃗r. Because this is the
most natural extension of the definition of correlation for random variables, we call this function the
correlation function:
ρ(⃗r) =
C(⃗r)
d
Var[Z(⃗x)]
= ρ[Z(⃗x + ⃗r), Z(⃗x)].
(15)
If our random fields are very well-behaved, we can simplify these functions one step further. Return-
ing to Figure 1, we can see that the random fields that we show in our first four panels are isotropic,
or rotationally-invariant: that is, there is no preferred direction along which the data seems to be
varying any more or less than in any other direction (the last two random fields in Figure 1 do
not have this property). If this is the case, then the covariance between two data points that are
separated by ⃗r will depend only on the magnitude of ⃗r. Under this condition, our covariance and
correlation functions simplify to:
C(r) = d
Cov(Z(⃗x), Z(⃗y))
(16)
ρ(r) = ρ(Z(⃗x), Z(⃗y))
(17)
where this time, the averages used to calculate the covariance (Equation 7) are taken over all pairs
of points ⃗x and ⃗y for which |⃗x −⃗y| = r.
14 Stationary is also used to mean strictly stationary, or strongly stationary.
Under this condition, all higher-order
moments also depend only on the separation between data points and not their positions – but that’s beyond the
scope of this work.
15 In standard cosmologies, the Universe is assumed to be homogeneous in the strong sense on large scales – that is,
all higher order moments are assumed be statistically the same for all pairs, or triples, or quadruples of points that
are separated by the same distances, irrespective of their positions. Just as it does for the power spectrum with
second-order homogeneity, this assumption allows higher-order statistics, such as the bispectrum and the trispectrum
of the cosmic microwave background, to be computed from a single realisation of the data.

12
Wait, isn’t this the two-point correlation function?
As this is a Note geared towards astronomers and astrophysicists, it would be remiss of us to
not mention the two-point correlation function.a Unfortunately, several different definitions for
this function are used, but the most commonly-used one is this: for a real-valued random field
Z(⃗x), the two-point correlation ξ is defined to be:
ξ(⃗x, ⃗y) = E [Z(⃗x)Z(⃗y)]
(18)
If the random field Z(⃗x) is second-order stationary (13 is true for all points ⃗x and ⃗y), then ξ
depends only on the separation between data points:
ξ(⃗r) = E [Z(⃗x)Z(⃗y)] ,
where
⃗r = ⃗x −⃗y.
(19)
And if Z(⃗x) is also isotropic, then ξ depends only on the distance between data points:
ξ(r) = E [Z(⃗x)Z(⃗y)] ,
where
r = |⃗x −⃗y|.
(20)
Despite being called a correlation function, what this function actually measures is something
closer to the covariance. If the random field Z(⃗x) has zero mean, then Equations 19 and 20 are
exactly the same as the covariance function that we define in Equations 14 and 16. In practise,
cosmologists always subtract the means from their fields before they compute the two-point
correlation functions. Provided that this step is performed, the resulting two-point correlation
function of the mean-subtracted random field will be exactly the covariance function of the
original random field.
We warn the reader that it is not uncommon to see ξ(⃗r) defined differently. In Equation 33.2,
Peebles (1980) defines the two-point correlation function (for a second-order stationary, but
not necessarily isotropic random field) as:
ξ(⃗r) = E [(Z(⃗x + ⃗r) −E [Z(⃗x + ⃗r)]) (Z(⃗x) −E [Z(⃗x)])]
E [Z(⃗x)]2
.
(21)
This is equivalent to the covariance function defined in Equation 14, but it has been normalised
by dividing by the mean of the random field squared instead of the variance. Like the correlation
function (Equation 15), it is unitless; but unlike the correlation function, it is not normalised
to lie between −1 and 1. Furthermore, this function cannot be calculated when E [Z(⃗x)] = 0,
as we would be dividing by zero.
a Don’t worry if you haven’t heard of it. A data scientist with a PhD in statistics had not heard of it,
either. Seemingly, this function is seldom seen outside of astronomy.

13
Other authors (e.g. Krumholz & Ting 2018) define the two-point correlation function ξ(⃗r) to
be precisely the correlation function as we define it in Equation 15. Others (e.g. Li et al. 2023)
define it as something that is equivalent to Equation 15 if and only if the field in question has
zero mean. In cosmology courses, it is commonly explained as the “excess probability” dP of
finding a galaxy in an infintesimal volume dV at a distance of r from another galaxy:
dP = n[1 + ξ(r)]dV,
(22)
which is intuitively the information that is provided by all of the functions defined above. How-
ever, as a mathematical statement, this definition is not consistent with any of the definitions
given above. Because the definition of this function is not universally agreed upon, we prefer to
refer to the covariance function and correlation function that we explicitly define as extensions
of the standard statistical concepts of correlation and covariance throughout the remainder of
this Note.

14
Wait, isn’t this the autocorrelation function?
Sadly, the answer is that it depends who you ask.
Lots of different definitions exist for the autocorrelation function. The one thing that almost
everyone agrees on is that it is the cross-correlation of a random field with itself. The cross-
correlation of two random fields (also known as a time series if ⃗x is one-dimensional) Z1(⃗x)
and Z2(⃗x) is sometimes defined to be:
(Z1 ⋆Z2)(⃗r) = E [Z1(⃗x)Z2(⃗x + ⃗r)]
(23)
In signal processing, the separation ⃗r is often referred to as the lag. This term has also been
adopted by geostatisticians. Here, the average is taken over all possible values of ⃗x for which
both Z1(⃗x) and Z2(⃗x + ⃗r) are known. If this definition is used, then the autocorrelation (the
cross-correlation between Z1(⃗x) and itself) is exactly the two-point correlation function as
defined in Equation 19.
Another definition for the cross-correlation is:
(Z1 ∗Z2)(⃗r) =
X
Z1(⃗x)Z2(⃗x + ⃗r)
(24)
If this definition is used, then the cross-correlation between Z1(⃗x) and Z2(⃗x) is exactly equiv-
alent to the convolution of Z1(−⃗x) and Z2(⃗x).a
If we subtract the mean of each random field before taking their cross-correlation, then we get
a function that is often called the cross-covariance function:
K12(⃗r) = E {(Z1(⃗x) −E [Z1(⃗x)])(Z2(⃗x + ⃗r) −E [Z2(⃗x + ⃗r)])}
(25)
Letting Z1(⃗x) = Z2(⃗x) = Z(⃗x), we get the auto-covariance function.b This function is exactly
the covariance function that we estimate in Equation 14.
Finally, people often like to take the cross-covariance function and normalise it by dividing by
the standard deviation of each of the random fields. Infuriatingly, this function is also called
the cross-correlation:
R12(⃗r) = E {(Z1(⃗x) −E [Z1(⃗x)])(Z2(⃗x + ⃗r) −E [Z2(⃗x + ⃗r)])}
p
Var(Z1(⃗x))Var(Z2(⃗x))
.
(26)
If this normalisation is done, then for a second-order stationary random field, computing the
cross-correlation between Z(⃗x) and itself, we get a function which is called the autocorrelation
of Z(⃗x) that is exactly equivalent to the correlation function that we define in Equation 15.
a If Z1(⃗x) is a complex-valued random field, then you actually need to take its complex conjugate as well as
flipping it. In this case, the cross-correlation between Z1(⃗x) and Z2(⃗x) is equivalent to the convolution of
Z1(−⃗x)∗and Z2(⃗x), where ∗represents complex conjugation.
b If things weren’t bad enough already, there is also no consensus on whether these function are supposed to
be spelled with a hyphen (auto-covariance, auto-correlation) or without one (autocovariance, autocorrelation).

15
4. STATISTICAL ISLANDS OF SPATIAL VARIATION
Now that we have gathered the tools we need (Equations 14-17), we are ready to set sail and
explore how these concepts are connected to the terminology used on three different “islands of
study”. We define and describe the tools used by geostatisticians (Section 4.1), cosmologists (Section
4.2), and fluid dynamicists (Section 4.3) to quantify the spatial variation that they see in their data.
In each Section, we briefly recount the history of the subfield’s methods, what they describe, the
mathematical formalism, and how they relate to the covariance and correlation functions defined in
Section 3. If the reader is familiar with a particular one of these subfields, it might be a good idea
to start from that Section. Otherwise, these three islands can be visited in any order.
Before we disembark, however, we would like to introduce our travelling companion RaFiel. RaFiel
is a random field. We show a picture of RaFiel in Figure 2. Using Equations 1 and 7, we can
compute the mean of RaFiel to be 0 and their variance to be 1. Looking at RaFiel, they appear
to be second-order stationary and isotropic, so covariance and correlation functions for RaFiel can
be defined in terms of the (scalar) distance r between points – that is, Equations 16 and 17 can be
used to fully describe their second-order structure. As we visit each island, we will describe the same
random field, RaFiel, with methods native to each domain, to highlight how they are similar and
different.
Figure 2. Introducing RaFiel, the example random field who we will be taking with us on our explorations.
We will use methods native to different disciplines to analyse this field in the subsequent sections in order
to determine RaFiel’s second-order spatial structure.

16
4.1. The Geostatistician’s Approach
The first island we will visit is the land of geostatistics. In the unfortunately colonial way that is all
too often seen in the history of statistics, geostatistics was born out of the African mining industry in
the 1950s, when a French geologist named Georges Matheron decided to take a statistical approach
to prospecting (Agterberg 2004). The original purpose of geostatistics was to give scientific answers
to questions like “Given that we see a high-grade block of ore in one location, and a low-grade block
of ore in a second, where should we dig next if we want to find the best, most mineral-rich mining
locations?”, and “How many core samples need to be dug out before we can understand how the gold
is distributed throughout a gold field?”
Matheron found the classical statistical techniques that were used in his field to be lacking, as
these approaches are not able “to take into account the spatial aspect of the phenomenon, which is
precisely its most important feature” (Matheron 1963). Basing his work partially on the notes of a
South African mining engineer, Danie Krige, Matheron formalised the foundations of geostatistics –
the subfield of mathematics and statistics that is concerned with the analysis of random processes
that vary over continuous spatial domains in a stochastic, yet predictable, way.
Since its inception, the geostatistical approach for spatial data analysis has been put to use in
a lot of diverse fields, including epidemiology, climate modelling, ecology, economics, and of course
geology. The use of geostatistical methodologies applied to astronomical data is an active area of study
(Clark et al. 2019; Gonz´alez-Gait´an et al. 2019; Metha et al. 2021). We include the semivariogram
in this Note for two reasons. Firstly, it requires less mathematical complexity to construct than
the subsequent methods. This makes it easier to compare to the fundamental statistics we describe
in Section 2. The second reason is that geostatistical methods show great promise in comparing
quantitative theories about the Universe to observational data (Metha et al. 2021), estimating values
in incomplete data sets (Metha et al. 2022), and may be helpful for other astronomical applications,
especially with high-resolution data (Metha et al. 2024). In this investigation, we will review one
important tool from the geostatistical approach, the semivariogram, and see what it can tell us about
the correlation structure of our friend RaFiel.
4.1.1. The Semivariogram
The semivariogram is a classical tool from geostatistics that is used for exploratory data analysis.
It is a data visualisation method, whose purpose is to reveal the spatially correlated nature of the
observed data. Intuitively, it shows how the variance between data points depends on their separation.
Examination of a semivariogram plot serves as a test to check if there is spatially (or temporally)
correlated structure in an observation of a random field – and, if it exists, further examination
reveals the amount of variance in the data explainable by spatially correlated effects, and estimates
the spatial scales over which such correlations are effective.
To compute the semivariogram for a second-order stationary (or homogeneous, or translationally-
invariant) random field, take all possible pairs of data points, Z(⃗x) and Z(⃗y), and group them by
their spatial separation ⃗r. We then look at the variance of the difference between each pair of values
that are separated by about ⃗r. Formally, it is defined as:
γ(⃗r) = 1
2
d
Var (Z(⃗x + ⃗r) −Z(⃗x)) .
(27)

17
If your random field also happens to be isotropic, then the semivariogram only depends on the
distance r = |⃗x −⃗y| between data points.
In this case, we can simplify our expression for the
semivariogram to:
γ(r) = 1
2
d
Var (Z(⃗x) −Z(⃗y)) ,
(28)
where the variance is computed over all pairs of points ⃗x and ⃗y for which r−δ/2 ≤|⃗x−⃗y| ≤r+δ/2.16
Here, δ is the bin width of the semivariogram. It is a hyperparameter, meaning that there is no perfect
mathematical way to select it. Really, any value will do, but you need to keep two things in mind:
firstly, the semivariogram won’t be able to tell you about anything that is happening on scales smaller
than δ – so if you are interested in variation on scales of tens of parsecs, choosing δ = 100 parsec is
a bad idea. This is a reason to make your bin size smaller. Secondly, you will need a few pairs of
data points in order to compute the variance in each bin reliably. This is a reason to make your bin
size larger. As long as your bin size is small enough that you can see what you’re interested in, and
large enough that you can be pretty confident in your variance estimates at each spatial separation
(common wisdom amongst statisticians states that we desire at least 30 (and preferably ∼50) data
point pairs at each separation; Schabenberger & Gotway 2005), then you’re good.17
The best way to learn about what a semivariogram can tell us is to look at one.
We show a
semivariogram of RaFiel in Figure 3. Because RaFiel is isotropic, we can use the one-dimensional
form of the semivariogram that we define in Equation 28.18 From this Figure, we can see that the
semivariogram, γ(r), tends to increase as r increases. This tells us that RaFiel obeys Tobler’s First
Law of Geography. For data points that are closer to each other (lower r), there is less variance in
the difference between their values – in other words, points that are close to each other tend to have
more similar values than points that are farther away.
16 This function, defined over bins of separations r, is sometimes referred to as the empirical semivariogram to distinguish
it from the theoretical semivariogram, which uses the exact variance (not an estimator), has no binning, and is
impossible to compute from observational data. Since we’re in the business of making sense of the world around
us and not of defining abstract mathematical functions, we will hereafter use the term semivariogram to mean the
empirical semivariogram. Both the empirical and theoretical semivariograms are also, confusingly, often referred to as
variograms – we warned you that the sea of definitions was treacherous!
17 But how can I measure how confident I ought to be about my variance estimates at each location? Well, to do that, you
need to estimate the uncertainty on your variance estimates at each location. That is, you need to find the variance
of your variance estimate. Formally, this can be done - and there’s nothing wrong with doing this - but people will
make jokes about you on the internet for it, the average elevation of eyebrows in your local area will rise by several
millimetres, and nobody on the statistics stack exchange will take your questions seriously. Most people don’t bother
with this, so don’t worry about it.
18 Be sure to check out the Jupyter notebook to see how this is done in practise. Let the soft hum of your laptop’s
computer fan be the peaceful white noise that soothes you as you read through the rest of this Section.

18
Figure 3. This is a semivariogram for RaFiel, our random field. The semivariogram increases for points
that are separated by larger distances, because points that are closer to each other tend to have more similar
values. This behaviour reflects what we see in the image of RaFiel (Figure 2). At a separation of r ∼20
pixels, the semivariogram tends to reach a near-constant value. This tells us at what distance points stop
being correlated with each other (called the range in the geostatistical literature).
At large separations (r ≳40 pixels), this semivariogram seems to flick up towards higher values
before becoming “‘wobbly”. This effect is commonly seen in semivariograms. As we go to further
separations, the amount of pairs available for each variance estimate decreases, and so the variability
in the semivariogram increases. It is nothing to worry about – all of the information that we are
interested in happens on spatial scales much smaller than the size of our box, before the semivariogram
becomes unreliable. Unfortunately, there is no exact formula for at which separation these edge effects
kick in – but a common recommendation is to only compute the semivariogram for separations of up
to 1
2 of the maximum separation in the data (Schabenberger & Gotway 2005). On the other end of
things, the smallest variations that a semivariogram analysis can pick up in theory are variations of
sizes equal to the minimum separation between observations – i.e. the size of an individual pixel. In
practise, the size of a fluctuation must be greater than about twice the size of a pixel to be accurately
captured using a geostatistical analysis framework (Metha et al. 2024).

19
By looking more closely at the semivariogram, we can read three important parameters that give
us a broad overview of the way that this data field is spatially correlated:
• The range is the separation between data points at which the variance appears to stop increas-
ing. There is no precise mathematical way that this is calculated – instead, we just look at the
data, and see where, approximately, it starts to get flat. This tells us where the largest spatial
variations in the data can be found. At spatial locations separated by more than the range, for
practical purposes, data points will be uncorrelated.
• The sill is the maximum height that the semivariogram reaches, or the height that it flattens
out to at its range. This value is equal the total variance in the data, as would be computed
from Equation 5.
• The nugget is the inferred height of the semivariogram as r →0. This quaint term gets its
origins from the gold fields where geostatistics was first hit upon – the size of a nugget of gold
was much smaller than the size of the rock samples that were analysed, so the presence or
absence of these nuggets would cause large variations in the amount of gold found between
samples, no matter how close they were to each other (Chiles & Delfiner 1999). This is an
example of a microscale variation – a variation that happens on scales smaller than the size of
a single sample of the data. We cannot see this variation with a geostatistical approach, as we
cannot make out any variations smaller than a single pixel.
In addition to being a handy tool for visualising the spatial structure of a random data field,
the semivariogram also satisfies several nice mathematical properties. For second-order stationary
random fields (that is, random fields for which Condition 13 is true for all pairs of points), the
following relationship holds:
γ(⃗r) = C(0) −C(⃗r),
(29)
where C(⃗r) is the covariance function defined in Equation 14. If we want to, we can manipulate this
expression to relate the semivariogram to the correlation function, ρ(⃗r), and the total variance, σ2,
of the data:
γ(⃗r) = σ2 (1 −ρ(⃗r)) .
(30)
So the semivariogram is just a manipulated version of the covariance function. It is related to the
correlation function, but captures one extra piece of information: the total variance present within
the random field. We will see how this function relates to other approaches used in astronomy in
subsequent Sections.
4.2. The Cosmologist’s Approach
The Fourier approach to quantifying spatial variations within cosmology and large scale structure
studies arose through the need to learn about structure on large scales with minimal information.
Modern cosmology began with Albert Einstein’s general theory of relativity (Einstein 1915) and
Edwin Hubble’s discovery of the expanding Universe (Hubble 1929). Both of these theories make
predictions about the relationship between and evolution of structures in our Universe on various
distance scales. To truly understand our universe on all scales, we need to study how often different
scales appear within our universe – that is, we need to know how correlated structures in our Universe
are.

20
In the mid 1900s, cosmologists discovered that the distribution of galaxies on the sky was not
uniform. Instead, galaxies were seen to cluster. If one patch of the Universe was seen to be rich in
galaxies, then nearby patches were found to be more likely to also be rich in galaxies (Hubble 1934;
Bok 1934; Mowbray 1938). In other words, on intergalactic scales, the Universe was seen to obey
Tobler’s First Law of Geography.
Zooming out even further, George Abell (1958) predicted that galaxy clusters themselves exist
within superclusters – clusters of galaxy clusters. This was accomplished by assuming all clusters were
distributed randomly with no correlation and calculating the probability (using classical statistical
methods) of observing the population counts of galaxies that he found in his data. Abell found that
the probability of the data having no spatial structure was virtually impossible, but still had no way
to quantify what the actual spatial structure was.19
Disappointed by the lack of good spatial statistical methods in their field, two cosmologists, grad-
uate student Jer Tsang Yu and his supervisor James Peebles, came up with a clever way to quantify
information on various spatial scales (Yu & Peebles 1969). Drawing inspiration from the concept of
power spectral density that was recently developed for signal processing applications (Blackman &
Tukey 1958), Yu and Peebles made the first application to quantify clustering length scales with the
power spectrum.
There now exists no cosmology textbook in the Universe that fails to mention the power spectrum
both as a function of spatial modes (or Fourier modes, k, defined in Section 4.2.1) and angular
modes (or multipole moments, l)20. The latter enables us to study the cosmic microwave background
(CMB)21, while the former most commonly allows for study of the evolution of large scale structures
in our Universe. Both of these tools have been used to develop insight into the evolution of our
Universe in synergistic ways.
19 This kind of calculation is called a p-value, where you assume that something you think is wrong (a null hypothesis) is
true, and then calculate the chances of it being right. If you find that there’s a low probability of the thing you think
is wrong being right (a low p-value), then you can be pretty sure that the thing you think is wrong really is wrong
(you can reject the null hypothesis). Trouble is, this approach won’t tell you what the right thing to believe instead is.
20 Often, cosmologists refer to the power spectrum as a two-point statistic, meaning it uses two points at a time to
quantify correlations. The trispectrum (See Glossary) is an example of a higher order statistic.
21 In 1965, two radio astronomers working at Bell telephone laboratories, Arno Penzias and Robert Wilson, were vexed
by a mysterious, constant microwave signal in their data. After evicting a family of pigeons that were living in their
satellite dish and cleaning out their droppings, the signal remained. Fortunately, Penzias and Wilson were in contact
with physicists from MIT who could identify what this signal really was: a signature of leftover radiation from moments
after the Big Bang, when the Universe was incredibly hot and dense. Dubbed the cosmic microwave background (or
CMB for short), this signal tells us a lot of information about what the Universe was like only 380,000 years after the
Big Bang (Penzias & Wilson 1965).

21
4.2.1. The k-mode Power Spectrum
In order to define what a power spectrum is, we must first introduce the Fourier transform – a way
to think about signals as the sums of many different, independent waves. Fourier transforms were
invented by the French mathematician and physicist Jean-Baptiste Joseph Fourier.22 In a nutshell,
Fourier’s theorem is this: any signal, no matter how complicated the shape as long as its reasonably
well-behaved23, can be expressed as the sum of many different waves. The Fourier transform, F,
takes a signal that is a function of ⃗r as an input, and figures out what combination of waves are
needed to reproduce that signal. We describe these waves in terms of their wavenumber, ⃗k. A wave
with a wavenumber of ⃗k oscillates along the direction of ⃗k, and will undergo one full cycle over a
distance of 2π
|⃗k|. The term k-mode is used to refer to a wave with a wavenumber of magnitude k.
The Fourier transform of a function f(⃗r) is written with a squiggly hat, ˜f(⃗k). The value of ˜f at each
wavenumber ⃗k tells you how much of that particular wave needs to be added in order to reproduce
your spatial signal, f(⃗r). We say that the function ˜f(⃗k) exists in Fourier space, or k-space, and the
function f(⃗r) exists in real space, or configuration space – but really, they are the same function.
Both f(⃗r) and ˜f(⃗k) contain the same information. This is essentially the same as the two functions
being expressed in two different bases.
To make these concepts more concrete, it helps to have an example that we are familiar with. In
classical music, orchestras all tune up by playing the same note: the A above middle C. To make
sure everyone is always playing the same note, this pitch is defined to have a frequency of exactly
440Hz.24 Taking a Fourier transform of this signal, we would see that ˜f(⃗k) has a large spike at a
value of ⃗k = +440 Hz, and is zero everywhere else – this signal can be made up of one wave alone.
If we were instead to play this concert A on a piano and feed the time series of the sound into a
spectrometer (a machine that lets us visualise amplitude versus frequency as a function of time), we
would see some amount of power at the harmonic frequencies of integer multiples of 440 Hz (at 880
Hz, 1320 Hz, 1760 Hz, and so on). If we repeated this experiment with a flute, or a trumpet, or a
cello, we would see that each of these different harmonics contributes a different amount to the final
signal. By looking at the spectral signatures revealed by taking the Fourier transform of our signals,
we could clearly distinguish between which instrument is being played.
To put this concept into precise mathematical language, we define the Fourier transform, F, for a
3-dimensional function f(⃗r) as follows:25
˜f(⃗k) = F{f(⃗r)} =
Z ∞
−∞
e−i⃗k·⃗rf(⃗r)dr3.
(31)
The Fourier transform is an information-preserving operation – that is to say, both f(⃗r) and ˜f(⃗k)
tell us everything that there is to know about a signal. Because of this, we can also use the Fourier
transform of a signal, ˜f(⃗k), to figure out what the original signal was. This is done through an
22 J. J. Fourier was a bit of a character. In addition to inventing the Fourier transform, he was also a governor of Lower
Egypt in Napoleon’s army, is credited with discovering the greenhouse effect, and enjoyed wrapping himself in a warm
blanket and walking around his mansion in it. This last hobby led to his tragic death in 1830, when, wrapped in a
blanket, he fell down his stairs and was unable to break his fall (Cox & Forshaw 2012). At least he died doing what
he loved.
23 Determining whether a Fourier transform exists for a signal is actually quite complex, and depends on the integrability
of the function. We direct the reader to Champeney (1987) for a more rigorous explanation of the conditions required
for a Fourier transformation to be performed.
24 You can hear this pure, 440Hz, concert pitch A tone here.
25 We stick with the cosmologist’s way of expressing the Fourier transform, but other fields use different conventions. For
example, you might see the negative sign in the exponent of the inverse Fourier transform or different placements of
2π. In this PDF with common conventions, we use the “standard layout”.

22
operation called the inverse Fourier transform, which can be calculated with the following formula:
f(⃗r) = F−1{ ˜f(⃗k)} =
Z ∞
−∞
d3k
(2π)3ei⃗k·⃗r ˜f(⃗k).
(32)
Interestingly, the formula for taking the inverse Fourier transform of a function in k-space (Equation
32) looks very similar to the formula that we use to take a Fourier transform of a function in real
space (Equation 31). With the definition of the Fourier transform commonly used in cosmology, the
key things to be careful about are the extra factors of 2π in the denominator of the inverse Fourier
transform, and the change to a positive sign in the exponent going from Equation 31 to 32. To
probe the spatial analog to musical frequency in Fourier space, we use k-modes which correspond to
a length scale of 2π/|⃗k| Essentially, the Fourier transform takes our data in terms of its actual value
at each position and converts it into an amplitude of different scales in our data.
We define the power spectrum of a second-order stationary (or homogeneous, or translationally-
invariant) random field as the Fourier transform of the covariance function of that field:26
P(⃗k) ≡
Z ∞
−∞
drne−i⃗k·⃗rC(⃗r),
(33)
where n is the number of dimensions of our random field. For an isotropic (statistically symmetric)
1D random field (i.e. time series data), this equation simplifies to:
P(k) = 2
Z ∞
0
C(r) cos(kr)dr.
(34)
In this case, the resulting power spectrum is often called the power spectral density (or some permu-
tation of those three words – see the Glossary).
For an isotropic, 2D field (such as RaFiel), then we can obtain the power spectrum from an integral
over one spatial dimension as well:
P(k) = 2π
Z ∞
0
C(r)J0(kr)rdr,
(35)
where J0(x) is the zeroth-order Bessel function of the first kind. If we instead consider a 3D field
that is also isotropic, then the power spectrum simplifies to:
P(k) = 4π
Z ∞
0
C(r)sin(kr)
kr
r2dr.
(36)
We skip the derivation, but you can find it here. Equation 36 is the k-mode power spectrum for
a second order stationary, isotropic field.
We’re still just one Fourier transform away from the
covariance function we defined in Equation 14 in Section 2. Because the Fourier transform is an
invertible transform, the power spectrum contains exactly the same information as the covariance
function does – if the power spectrum of a random field is known, then its covariance function can be
26 If you read any cosmology textbook on the planet, you will probably see the power spectrum instead defined as the
Fourier transform of the two-point correlation function, ξ(⃗r). This is exactly equivalent to what we are doing
here. We choose to write it in this way for two reasons. Firstly, the two-point correlation function actually measures
covariance for a zero-mean field, not correlation as defined in Equation 9. Secondly, nobody outside of cosmology knows
what a two-point correlation function is. If you are a cosmologist who wants to be able to explain your methodology
to people outside of astronomy, this is the least confusing way to do it that we could think of.

23
reconstructed. We can go one step further, and convert the Fourier transform of the power spectrum
to the semivariogram using Equation 29. Since this equation is also invertible, these two equations
form a bridge that allows the methodology used to capture second-order structure by cosmologists
to be translated into the methodology used by geostatisticians: both approaches capture the exact
same information about the random fields under investigation.
In the ideal case, when we move our quantification of spatial correlations into Fourier space, we are
only capturing variances of k-modes, i.e., we are diagonalising the covariance matrix. When we look
at the evolution of the matter distribution in our Universe to first order through the lens of the power
spectrum, only the amplitude of the power spectrum changes as a function of time. The k-modes in
the matter power spectrum of our Universe evolve independently. This convenience exemplifies the
utility of the power spectrum within cosmology.
Wait, how does the power spectrum diagonalise the covariance matrix in Fourier
space?
We know that the power spectrum is the Fourier transform of the covariance function of our
data (or the two-point correlation function, or the autocorrelation function, depending on who
you talk to – see the red Wait! boxes in Section 3), but it also presents interesting consequences
in Fourier space by effectively removing covariance in k-space. We can make this connection
explicit with the following Equation, which holds true for second-order stationary random fields
(Z(⃗x) for which Condition 13 holds true for all pairs of points ⃗x and ⃗y):
Cov( ˜Z(⃗k + ⃗
∆k), ˜Z(⃗k)∗) = (2π)Dδ(∆⃗k)P(⃗k),
(37)
where δ( ⃗
∆k) is the Dirac delta function, which is equal to 0 everywhere except when the
difference between k-modes ⃗
∆k = 0, and D is the dimensionality of our data. This shows
that when we move to Fourier space, the k-space representation of its covariance matrix is
diagonal. Any off diagonal elements, i.e. elements for which the difference between k-modes
⃗
∆k ̸= 0, will have covariances of zero in Fourier space. That is, there will be no correlation
between different k-modes in Fourier space. Equation (37) comes from the fact that our data
is translationally invariant, and the nature of the Fourier transform itself. If the random field
Z(⃗x) was not statistically translationally invariant to second-order, then there would still be
correlations between different k-modes in Fourier space. For further details and a proof of
this, see Section 4 of this Note.
Also, in real observations, we’re never actually perfectly
diagonalising covariance matrices due to the effects of foregrounds, galactic extinction, survey
geometry, and instrumental effects, which can be collectively encapsulated in functions termed
window functions (Liu & Tegmark 2011; Karim et al. 2023).
The computational implementation and mathematical formalism for calculating a power spectrum
may not seem equivalent on first glance, but Equation 37 provides the first glimpse into its numerical
calculation. Constructing the 3D power spectrum numerically requires summing up power in spherical
shells in Fourier space. We show how this is implemented computationally in our Jupyter notebook
tutorial. In addition to the convenience of describing the matter evolution of our Universe through the
matter power spectrum, Fast Fourier Transforms (FFTs) allow us to compute the Fourier transform
of our fields and thus the power spectrum extremely efficiently.

24
In a real world computation of the power spectrum, there are limits to the minimum and maximum
spatial scales that we can probe. Consider the Fourier transformed version of our field in k-space.
Although we could in theory calculate infinitesimally small k-modes, the smallest k-mode that we can
learn about corresponds to the size of the field of view of our data, Lmax – that is, kmin =
2π
Lmax. At this
spatial frequency (i.e. k-mode), each oscillation covers exactly one pixel in Fourier space. Decreasing
the frequency of these waves would create multiple oscillations over the same Fourier space pixel
and does not provide any new information. On the (real space) small scale or (Fourier space) larger
k-mode end, we could also theoretically take larger and larger k-modes. When computing the power
spectrum, our maximum useful k-mode is
2π
√n pixel size, where n is the number of dimensions of our
field. For example in 2D, the largest wave we can form in our square box in Fourier space will have
a magnitude of k = pk2
x + k2
y, which is k =
√
2k2 because of isotropy. In the actual computation of
a 2D power spectrum, this is the same as putting an upper limit on the size of the circle in Fourier
space in the circular binning that we use.
A power spectrum of RaFiel can be seen in Figure 4 where we show the power spectrum as a
function of k-magnitude (
2π
pixels). After learning about semivariograms, the x-axis of Figure 4 will be
backwards to your intuition: the largest spatial scales correspond to the k-modes closest to zero, and
the smallest spatial scales are the largest k-modes. We see more noise in the power spectrum at the
largest k-modes, as it gets harder and harder to probe smaller scales as we reach down towards our
resolution limit.
4.3. The Fluid Dynamicist’s Approach
The third island in our whirlwind tour of spatial data analysis methodologies is the world of fluid
dynamics, where random fields are usually a result of turbulence.
Turbulent flows are spatially and temporally stochastic. They are uneven, unstable and unpre-
dictable, with large, irregular variations in the fluid velocity appearing over a wide range of spatial
scales. Despite these difficulties, understanding turbulence is important in science and engineering
for a lot of reasons – for example, to make accurate predictions of the weather, to understand the
mechanisms that quench star formation in the interstellar media of galaxies, and to make sure that
our aeroplanes don’t fall out of the sky.
While the equations that govern turbulent fluids are well-known and easy to write down, trying to
solve them (or even knowing if they can be solved) is a million dollar question.27 To make progress
on this problem, two steps had to be made. The first was to consider the statistical properties of
the pressure, velocity, and temperature of the resulting turbulent field, rather than trying to model
how they evolve explicitly. This is what we have been doing with all of the random fields that we
have encountered so far already, so this step is not new to us. The second step was made by British
meteorologist Lewis Fry Richardson28, who (to the best of our knowledge) came up with the theory
of the turbulent energy cascade (Richardson 1922). In this theory, turbulence begins with large-
scale eddies, which are unstable, and break up into smaller eddies, which are unstable, and break
up into smaller eddies on smaller scales still. This process stops when the eddies become so small
27 We’re not kidding. At the turn of our millennium, the Clay Mathematics Institute of Colorado put a million dollar
bounty on seven mathematical problems that they wanted to see solved in the next thousand years. Finding out if
smooth solutions always exist to the Navier-Stokes equation, which describes how turbulent flows evolve with time, is
one of these problems.
28 In addition to being a meteorologist, Richardson was a pacifist, in a very mathematical way. After developing the
mathematical technology that is used to predict the weather, Richardson attempted to use that same maths to model
how wars start, in order to figure out how to prevent them. While he was not completely successful in preventing all
future wars, he did end up writing a very interesting book on the topic (Richardson 1960).

25
Figure 4. This is the power spectrum for RaFiel, our random field. We express the k magnitudes following
the cosmological Fourier transform conventions. The k magnitude units are 2π/pixels. Most of our power
is at small k-modes (large spatial scales). This makes sense as we can see significant large scale structure
in RaFiel extending over many pixels (Figure 2). The largest k-mode that we show represents the smallest
scale that you can probe in your field, which is
√
2 pixels (Section 4.2.1). We can only see correlations that
are larger than that size.
that the random motion of particles becomes comparable to the sizes of the eddies, at which point
the turbulent kinetic energy diffuses into thermal energy. This theory is summarised in Richardson’s
famous little poem: Big whirls have little whirls that feed on their velocity, and little whirls have
lesser whirls, and so on to viscosity (Richardson 1922).
The key takeaway from Richardson’s theory is this: the scale over which these turbulent fluctuations
occur is important. This paved the way for two Soviet mathematicians, Andrey Kolmogorov and his
collaborator Alexander Obukhov, to explore how the variation in the velocity of a turbulent medium
changes over different spatial scales (Yaglom 1990). In 1941, both of these mathematicians delivered
groundbreaking results on how turbulence is expected to behave in a statistical sense over a range
of spatial scales. However, in a historical twist that will shock absolutely no one by this point, both
mathematicians chose to present their results using different tools and different terminology. Obuhkov
(1941) presented his results on the way that turbulence is structured in Fourier space, using a tool
called the energy spectrum that is exactly the same as the power spectrum that was explained
in Section 4.2.1. On the other hand, Kolmogorov (1941) stayed in real space, and computed the
spatial correlation structure of turbulent velocities using a tool that is now known as the structure
function. To this day, both structure functions and energy spectra are used in turbulence analysis.

26
4.3.1. The Structure Functions
The p-th order structure function of a scalar-valued random field Z(⃗x) at a separation of ⃗r is defined
to be the average absolute difference between values of the random field at points separated by ⃗r
raised to the power of p, where p is a positive integer:
Sp(⃗r) = E [(|Z(⃗x + ⃗r) −Z(⃗x)|)p] .
(38)
In other words, it is the p-th moment of the absolute difference between Z(⃗x + ⃗r) and Z(⃗x).29 In
the case where Z(⃗x) is isotropic, then this value will only depend on the magnitude of ⃗r and not its
direction, and so we can define the structure function as:
Sp(r) = E [(|Z(⃗x) −Z(⃗y)|)p] , where |⃗x −⃗y| = r.
(39)
The third, fourth, and higher order structure functions tell you about the third, fourth, and higher
moments of your random field – but since we are only focusing on second-order spatial structures of
random fields in this Note, we limit our discussion to the structure functions of the first and second
order and its comparison to other two-point statistics: the semivariogram (Section 4.1) and the power
spectrum (Section 4.2).
The first-order structure function seeks to achieve the same purpose as the semivariogram and
power spectrum – to quantify the spatially-correlated nature of variability within the data as a
function of scale. However, unlike the semivariogram, which looks for the variance between pixels at
a given separation, the first-order structure function is based around a different measure of spread –
namely, the mean absolute deviation between pairs of values. In spirit, this measurement is similar
to a standard deviation up to a small correction factor – for Gaussian distributions, the size of the
mean absolute deviation between pairs is
2
√πσ, or approximately 1.13σ. However, the first-order
structure function is not commonly used to quantify spatial correlations, as its results cannot be
readily converted into statements about variances, standard deviations, or correlations (i.e.
the
useful statistics described in Section 3) for random fields in general.
The second-order structure function instead tells us about how the covariance between data points
Z(⃗x + ⃗r) and Z(⃗x) depends on their separation, ⃗r. If this sounds similar to a semivariogram, that’s
because it is: for second-order stationary data or data with zero mean, the second-order structure
function is related to the semivariogram by the following equation:
S2(⃗r) = 2γ(⃗r).
(40)
Combining this insight with Equation 29, we can relate the second-order structure function to the
covariance function which is defined for second-order stationary random fields in Equation 14:
S2(⃗r) = 2 (C(0) −C(⃗r)) .
(41)
Using this, we can in turn express the second-order structure function in terms of the correlation
function, ρ(⃗r), that we define for second-order stationary random fields in Equation 15 and the total
variance σ2:
S2(⃗r) = 2σ2 (1 −ρ(⃗r)) .
(42)
29 There are other ways to compute these structure functions – for example, rather than using pairs of points at two
locations (⃗x + ⃗r and ⃗x), you could also use a triple of three points (⃗x + ⃗r, ⃗x, and ⃗x −⃗r) to compute the structure
functions of any order. For details on why you would want to do it this way and equations for how this is done, see
Seta et al. (2023).

27
So the second-order structure function is also just a manipulated version of the covariance function.
This shows us that the second-order structure function gives exactly the same information that a
semivariogram does! Because we already showed how power spectra and semivariograms contain
the same information, this means that the second-order structure function also gives the exact same
information that a power spectrum (or an energy spectrum as it’s called in the turbulence literature)
does.

28
Covariance Function (Equation 14) 
Semivariogram (Equation 27) 
Structure Function (Equation 38 for n = 2) 
Power Spectrum (Equation 33) 
Fourier Transform
Figure 5. This is the Rosetta Stone that we have been searching long and hard for. We summarise how all
methods described in this Note are directly related to the covariance function, which shows how they can all
be translated into each other. Note that all of these formulae assume a second-order stationary, real valued
random field.
5. CONNECTING THE ISLANDS
All of the three methods that we have explored (power spectra, semivariograms and second-order
structure functions) give the same information. All of them can be transformed into the covariance
function, which is the same as the two-point correlation function for a zero-mean field. Dividing by
the variance, we get the correlation function, which is sometimes (but not always) the same as the
autocorrelation function, depending on who you ask. We show how all of these different functions
are connected in Figure 5.
A natural question to ask after having read this note is, “When should I use each of these methods?”
Unless you’re applying a new statistical technique, you will probably end up applying whichever one
is already used to your subfield. Nonetheless, we provide a brief overview of when to use the power
spectrum versus another method described in this Note not requiring the Fourier transform.
To avoid Fourier space, the semivariogram or the structure function present themselves as the ideal
alternative to the power spectrum. The benefits of these methods is that both deal very well with
random fields that have many missing data points, which are commonly encountered in astronomical

29
problems. These methods also benefit from their explainability, as they require less mathematical
underpinning than the Fourier methods.
Conversely, the methods that employ the Fourier transform (the energy spectrum or power spec-
trum) benefit from speed and computational efficiency. For a semivariogram (or second-order struc-
ture function) to be computed, a distance matrix must be constructed between all pairs of data points,
which can use up a lot of memory, making calculations slow or even impossible without access to
supercomputing resources.
Progress has been made in both of these fields to overcome their respective weaknesses. Much
literature has been published by astronomers on strategies to make the power-spectrum method
work when data points are missing (e.g. Stutzki et al. 1998; Bensch et al. 2001; Ossenkopf et al.
2008; Ar´evalo et al. 2012; Benoit-L´evy et al. 2013; Raghunathan et al. 2019). At the same time, an
algorithm has been published to allow for fast computation of semivariograms using the fast Fourier
transform (Marcotte 1996).
In other words, members of these mathematical communities have developed novel methods for
improving all of these techniques in isolation. We advocate for more communication between members
of these different communities for cross-disciplinary approaches for data analysis to be shared, so that
the lessons learned in one field may benefit researchers in other disciplines.
6. AFTERTHOUGHTS
In this Note, we’ve constructed a Rosetta Stone for quantifying spatial correlation. We journeyed
to three different subfields where spatial statistics are used to describe data, and found that the
techniques developed on these different islands are really not as different as we had thought when we
set off. The approaches of the cosmologist, the geostatistician, and the turbulence researcher appear
quite different at first glance, but their respective tools, the power spectra, semivariograms, and
structure functions, actually capture exactly the same information about how random fields behave.
We show in the infographic displayed in Figure 5 how these are all connected to the fundamentals
that we defined in Section 2, and their natural extensions for spatial data (Section 3).
We began this Note with an observation: Things that are close to each other tend to be similar in
other ways. As we finish this piece, we realise that the very reason it was written is because the two
authors are both PhD students at the same university with desks fifteen paces apart from each other.
We found ourselves facing the same problems of quantifying spatial correlations in astronomical data,
but with very different tools in our hands.
In science, we often find ourselves isolated on our respective islands, unable to communicate with
researchers outside of our own particular subfield. Prior to writing this Note, SB’s conversations
around quantifying spatial correlations were nearly exclusively with cosmologists, but she had a
strong intuition that semivariograms seemed to be very similar to (and maybe even the same as) the
power spectrum that is used to study matter density fields in the early Universe. Similarly, BM was
unable to make sense of the terms used by turbulence researchers, limiting his ability to see how the
data that he was analysing with geostatistics could be connected to astrophysical theory about the
structure of the chaotic, magnetic interstellar media of galaxies.
We leave the reader with some strong encouragement to talk to other scientists outside your sub-
fields. That could be at colloquia, conferences, university hallways, on the internet, or wherever else
good science is done. Yes, the languages spoken in each subfield contain exotic terminology and
definitions that will disagree with what you are familiar with, but the payoffs are absolutely worth

30
it. These other islands are not abandoned; they are filled with very smart people who have spent
decades working hard to solve the same problems that you have. Talk to each other!
When we zoom out through discussions with other scientists, we can start to orient ourselves. In
this Note, we aimed to produce a map of the archipelago of domains where spatial statistics is studied.
Instead, we discovered that there are land bridges between all of these different islands, and all of us
are on a collective scientific Pangea. We have taken you on a journey with us, only to show you that
we never really left home. To help you on your own journeys, we leave you with this glossary.
GLOSSARY OF TERMS
angular power spectrum – the angular analog to the power spectrum defined in Section 4.2. In
the derivation of the k-mode power spectrum, functions are imagined to be sums of many waves in
space. Similarly, in the derivation of the angular power spectrum, functions are instead defined in
terms of spherical harmonics. The spherical harmonics represent the fundamental modes of “vibra-
tion” on a sphere. The angular power spectrum is the classic way to quantify spatial correlation in
the cosmic microwave background (CMB) because it is almost entirely Gaussian.
autocorrelation – the cross-correlation of a signal with a copy of itself that has been shifted in
space or time. Unfortunately, “cross-correlation” has many different definitions that are all in use
(Equations 23, 24, and 26), so this term can be a bit ambiguous to use.
autocovariance – the covariance of a signal with a copy of itself that has been shifted in space or
time. It is equivalent to the cross-covariance (Equation 25) of a signal with itself.
Bessel’s correction – the use of n −1 in the denominator in the equation for variance (Equation
5) rather than n, so that the sample variance is an unbiased estimator of the true variance.
bispectrum – similar to a power spectrum, but using the third cumulant of a random field instead
of its variance. Explaining what a “third cumulant” is goes beyond the scope of this work – but just
know that it can be used to show how different a random field is from a Gaussian one.
central limit theorem – see Gaussian.
configuration space – a cosmologist’s term for space before any Fourier transforms are applied.
In other words, configuration space is simply real space. We use this term because if we just called
it “space” people might think we are talking about Fourier space.
convolution – given two random fields Z1(⃗x) and Z2(⃗x), the convolution of Z1 and Z2 is given by
Z1 ∗Z2(⃗x) = P
⃗y Z1(⃗y)Z2(⃗x−⃗y), where the summand is taken over all pairs for which both Z1(⃗y) and
Z2(⃗x −⃗y) are defined. This function sees a lot of use in the field of signal processing. The Fourier
transform of the convolution of two random fields is equivalent to the pointwise product of the two
random fields in Fourier space (F{Z1 ∗Z2} = F{Z1}F{Z2}). This equation is often used to quickly
calculate convolutions via the fast Fourier transform algorithm.
convolutional neural network (CNN) – a kind of neural network in which repeated convolutions
are performed on the input data. Fortunately, these are well beyond the scope of this work.
correlation – a statistical measure of the way that two variables are related. If two variables are
positively correlated, then if one is measured to be higher than normal, the other one will probably
be higher than normal, too. If two variables are negatively correlated, then if one is measured to
be higher than normal, the other one will probably be lower than it usually is. Variables with a

31
correlation of zero are uncorrelated.30 In this Note, we use Pearson’s correlation coefficient (Equation
9) as our preferred correlation statistic of choice – anytime we refer to correlation, we refer to this.
correlogram – another term for the autocorrelation function (Schabenberger & Gotway 2005).
covariance – given two random variables X and Y , the covariance is a statistical measurement
that tells you how changing one affects the other (in other words, it tells you how X and Y covary).
A positive covariance means that when X is higher than its mean, Y will also be higher than its
mean (and when X is lower Y will be lower). A negative covariance means that when X is higher,
Y will be lower (and when X is lower, Y will be higher). A mathematical definition for covariance
is given in Equation 7.
cross-correlation – unfortunately, many different mathematical functions all share the name of
cross-correlation (Equations 23, 24, and 26). For this reason, when reading about cross-correlations
or autocorrelations, it is important to pay attention to which definition the author has chosen to use,
because it may not match the definition that you are familiar with. That being said, the intuitive
idea behind cross-correlations is the same for all definitions: Take two random fields Z1(⃗x) and Z2(⃗x),
shift the second one by a lag ⃗r, multiply them together, and take some kind of an average. This tells
you how similar values of Z1(⃗x) are to values of Z2(⃗x + ⃗r), and is very useful for signal processing.
For example, cross-correlation of voltages in time underpins radio interferometry which allows us the
most precise localisations of objects in the universe using radio telescopes separated by long distances.
cross-covariance – the covariance between one random field Z1(⃗x) at one position ⃗x, and a second
random field Z2(⃗x+⃗r) at a different location ⃗x+⃗r. A definition for this function is given in Equation
25.
cumulative distribution function (CDF) - the CDF contains the same information as the
probability density function described in Section 2, except it tells you what the probability is that
X will take a value that is equal or less than x: CDF(x) = Pr(X ≤x). We can translate from a
probability distribution to a cumulative distribution with integration: CDF(x) =
R x
−∞p(x
′)dx
′.
cross-power spectra - the Fourier transform of the covariance between two different fields or time
series. For example, Pxy(k) ≡
R ∞
−∞dre−ikr⟨Tx(r)Ty(r)⟩is the cross power-spectrum between the field
Tx, and the field Ty, assuming translational invariance (to the second order) and isotropy hold in
both fields.
distribution – in our context, it’s a shorter way of saying probability distribution.
energy spectrum – a synonym for power spectrum commonly used in the study of turbulence.
expected value – the value that a random variable, or a function of a random variable, is expected
to take. The expected value of a random variable is its mean.
estimator – a function or rule that maps a collection of samples of a random variable to an
estimate of a statistic or parameter. We almost always need to use estimators for statistics rather
than the statistics themselves as we are not usually able to calculate these quantities for an entire
population exactly, as that would require knowing the population exactly. However, we can always
use samples of our random variable to calculate sample estimates. Common estimators include the
sample mean and sample variance which estimate the expectation value and variance of a random
variable, respectively (Equations 1 and 5). In this Note, we denote an estimator with a hat – for
example, d
Var(X) is an estimator of Var(X).
30 Note that independence is a stronger condition than being uncorrelated. Independence means there is absolutely no
dependence between two random variables. Being uncorrelated only means that there is not any linear dependence.

32
Fourier transform – an integral transform which allows us to express our data in terms of fre-
quencies (or scales for our random fields) by decomposing a signal into a sum of waves. See Section
4.2 for further details.
Gaussian - shorthand for the Gaussian distribution31, or normal distribution32.
If a random
variable is known to have a Gaussian distribution, then you can describe its entire probability dis-
tribution (and therefore know everything about it) simply by knowing its mean and variance. For a
random variable X with a mean of µ and a variance of σ2, the probability of X taking any value x is
P(X = x) =
1
√
2πσ2 exp

−1
2
  x−µ
σ
2
. This definition can be extended to the multivariate case. Con-
sider a collection of random variables X1, X2, . . . , Xn (which we can stack into a one-dimensional col-
umn matrix called X). If we know that each of these distributions has a mean of µ1, µ2, . . . , µn (which
we stack into a matrix called µ), and we have the covariance matrix between all of these random vari-
ables (we call this Σ, where Σij = Cov(Xi, Xj)), then we again have enough information to completely
know the probability distribution of (and therefore, everything we could possibly want to know about)
this collection of random variables – it is P(X = x) =
1
√
(2π)n|Σ| exp

−1
2(x −µ)TΣ−1 (x −µ)

. Here,
|Σ| is the determinant of the covariance matrix, which roughly speaking tells you about the overall
size of the variance in your data in the same way that σ2 does for the one-dimensional case. The
normal distribution is important for statisticians because it comes up all the time – literally. There’s
a theorem in mathematics (the central limit theorem33) that states that if you take enough inde-
pendent observations of any random variable and average them, then after infinite observations, the
distribution you get will always be exactly the normal distribution, irrespective of the distribution
that you started with. Of course, taking infinite samples of a random variable is infinitely expensive
and takes infinite time, so we often have to make models of our random variables that take into
account non-Gaussianities – that is, differences between the actual distributions that we see and the
simple, idealised normal distributions.
homogeneous – a random field is homogeneous if it is the same at all points. Exactly what needs
to be the same at each point is not specified by this term. Sometimes it is used to mean that the
exact values of the random field must be the same everywhere. Other times it is used to mean
that the second-order or higher-order statistics of a random field must be the same everywhere – see
stationary.
isotropic – a random field is isotropic if it looks about the same in every direction (i.e. there is
no preferred direction). In other words, it looks statistically the same even if it is rotated (i.e. it is
rotationally invariant).
k-mode – the way cosmologists define their Fourier space modes, k = 2π
r , where r is a distance. In
the temporal domain, we often describe our modes as frequencies, and k-modes are spatial analogs
that can be thought of as spatial frequencies. Here we show the a k-mode in 1D, but we can easily
extend this to 3D with k = (kx, ky, kz) =

2π
rx , 2π
ry , 2π
rz

.
l-mode and m-modes – the angular analogs to the k-mode that are indices of the spherical
harmonics that allow us to expand functions on a sphere.
31 Named after Carl Freidrich Gauss.
32 Named after Karl Fredrick Normal.
33 A very nifty and well-explained proof of the central limit theorem can be found in this note.

33
lag – this is a word that geostatisticians and signal processing people sometimes use, but it just
means “separation”. In time series data, temporal lag refers to the separation in time between two
signals: Given two events at Z(t1) and Z(t2), the lag between them is t2 −t1. This distance is often
given the symbol τ. In spatial data, spatial lag refers to the separation in space between two signals:
Given two events at Z(⃗x1) and Z(⃗x2), the lag between them is ⃗x2 −⃗x1. In this Note, we use the
symbol ⃗r to refer to this separation.
mean – a measure of the centre of a random variable. It is our best guess for the approximate value
that this random variable should take. For this reason, it is also called the expected value (Equation
1).
median – an alternative measure of the centre of a random variable. It is the most central value
of the random variable that we measure – if we measured a random variable 101 times, 50 of our
samples will be below the median, 50 will be above the median, and one (the middle one) will be the
median.
mode – another alternative measure of the centre of a random variable. It is the most likely value
for our random variable to be, or the value that appears most often in our sample. For a Gaussian
random variable, the mean, median, and mode will all be equivalent – but this should not be expected
to be true for other kinds of distributions.
moment – the n-th moment of a random variable X is the expected value of Xn. The first moment
of a random variable is its mean. The second moment of a random variable is related to its variance.
Higher order moments tell you about other properties of the distribution of a random variable that
we do not cover in this Note.
periodogram – a common way of estimating the power spectrum of time series data. The power
spectrum and periodogram are exactly equivalent as described in VanderPlas (2018), except the
periodogram is commonly divided by the number of frequencies.
It’s optimal for estimating the
power spectrum for time series data that have been unevenly sampled. If you want to learn more
about the periodogram, VanderPlas (2018) provides an intuitive introduction to this topic and more
generally quantifying correlations in time series data.
p-value – a kind of calculation in which we assume something which we think is wrong (the null
hypothesis), and then compute the probability of that wrong thing being right. If we measure a low
probability (a low p-value) of the thing we think is wrong being correct, then we can conclude, with
some confidence, that the thing we thought was wrong might actually be wrong (we can reject the
null hypothesis). The problem is, such an analysis won’t always tell you what the right thing to
believe is instead.
power spectral density (PSD) – see power spectrum. This term seems to be more commonly
used in more engineering related fields, and is generally used in reference to time-series data. While
cosmologists talk about power spectra, the engineers that build the telescopes that cosmologists
use to capture the power spectra will call this same statistical measure the power spectral density.
Astronomers who study pulsar timing arrays also prefer the term power spectral density.
power spectrum - the Fourier transform of the covariance function (Equation 33). Many cosmol-
ogy textbooks will define the power spectrum as the Fourier transform of the two-point correlation
function instead – but beware! The two-point correlation function has multiple definitions, not all of
which are equivalent, and the standard definition of the two-point correlation function (Equation 18)

34
can only be used to define the power spectrum if the random field under investigation has a mean of
zero everywhere.
probability distribution – a function that completely describes a random variable. For every
value that the random variable can possibly take, the probability distribution tells you how likely
each possible value is to occur.
random variable – a mathematical structure that is used to describe something whose value is
not certain. Different measurements of a random variable may result in different values. A random
variable can be fully described by its probability distribution.
random field – a mathematical structure that is used to describe a system that is stochastic in
space or time. In this structure, every point in the domain can be described by a random variable. In
general, points that are closer to each other tend to be more tightly correlated in their values – this is
Tobler’s First Law of Geography. In this Note, we provide explanations of many different techniques
that are used to analyse the correlated spatial structure of random fields.
second-order stationary – a random field is second-order stationary if it follows Condition 13 –
that is, if it has a constant mean throughout, and the covariance between the values of the random
field at any two points ⃗x and ⃗y depends only on the separation between ⃗x and ⃗y. If a field satisfies
this second condition, then it will also necessarily have a constant variance throughout (to use a
statistics term, it will be homoskedastic). If a field is second-order stationary, then we can define its
second-order statistics (its covariance and correlation) as functions of one variable: the separation
between data points.
semivariogram – a statistical tool from the realm of geostatistics that is useful for visualising how
the variance between data points increases with their separation. We give a mathematical definition
in Equation 27. For second-order stationary random fields (where the covariance between data points
depends only on their separation), the semivariogram is closely related to the covariance function, as
shown in Equation 30.
serial variation function – an old historic word for the semivariogram. The plot of this function
was also called the serial variation curve (Jowett 1955a,b).
signal to noise ratio (SNR) – in astronomy and beyond, a very useful measure for how well you
can see something is the signal to noise ratio, often abbreviated to SNR, or S/N. This is the intensity
(it could be brightness, or loudness, or spectral flux density) of a signal, divided by the standard
deviation (or the noise level, σ) of that signal. Because the standard deviation has the same units as
the signal itself, the SNR will always be dimensionless, regardless of what we are measuring or what
units we use to measure it. You may not understand what magnitudes or decibels or janskies are,
but you can always understand what a SNR of 10 means: the signal is ten times stronger than our
uncertainty about it. This is also called a 10σ detection.
standard deviation – a useful metric that captures how much a random variable tends to vary.
It is equal to the square root of the variance, and is usually denoted by the symbol σ. The units of
the standard deviation will always be the same as the units of the original random variable.
stationary - originally used with time series data, this word is a synonym for homogeneous. Just
like the word homogeneous, the term stationary is used to refer to two different conditions – see the
entries for strictly stationary and second-order stationary.
strictly stationary – a random field is strictly stationary if all of its statistical properties are
translationally invariant. This is stronger than the assumption of second-order stationarity because

35
it also assumes that all higher-order statistics of the data field do not depend on the absolute position
of data points, and only on their separation relative to each other.
strongly stationary – a synonym for strictly stationary.
translational invariance – see homogeneous.
trispectrum – the Fourier transform of the fourth cumulant of a random field. Like the bispectrum,
it can also be used to search for and characterise non-Gaussianities.
Also, like the bispectrum,
describing it properly in a way that would do it justice is beyond the scope of this Note – see
Sefusatti & Scoccimarro (2005) which describes a trispectrum estimator.
two-point correlation function - a way to quantify the clustering of values in a random field as
a function of separation, commonly applied in cosmology. One common definition is that it is the
function that describes how likely it is to find pairs of points at a given distance compared to a random
distribution (Equation 22).
The two-point correlation function can be viewed as the continuous
analogue of the covariance matrix. However, as described in the Boxes on pages 12–14, there are
multiple definitions of the correlation function used in astrophysics, many of which overlap with what
is sometimes referred to as the autocorrelation function. These definitions do not always align with
the variance normalized version preferred by statisticians presented in Equation 15. Confusingly, it
is not a correlation coefficient – it has units equal to the units of the random field squared, and it is
not normalised to lie between −1 and 1.
variance – a statistical measure of how much a random variable tends to vary. It is computed by
summing the squared difference between every data point and the mean, and dividing by the number
of data points minus one (Equation 5). The units of variance will be equal to the square of the units
of the random variable. The covariance between a variable and itself is its variance.
variogram – is a tricky word. In most instances, it is a synonym for semivariogram – this is how
Georges Matheron originally used it, and how it is used in some textbooks (e.g. Chiles & Delfiner
1999). Some other authors define it differently, saying that the variogram is equal to twice the semi-
variogram. For example, Schabenberger & Gotway (2005) call out Chiles & Delfiner (1999) by name
for their definition of a variogram, stating that “there is nothing “established” about being off the
mark by factor 2”, and that the semivariogram as we defined it shouldn’t be called a variogram be-
cause “the savings in ink are disproportionate to the confusion created when “semi” is dropped”. We
recommend simply avoiding the use of the term variogram entirely – since it usually (but not always)
means the same thing as semivariogram, it’s much less confusing to only talk about semivariograms.
wavenumber – a synonym for k-mode, either a vector or its magnitude.
weakly stationary – a synonym for second-order stationary.s
weak-sense stationary – a second synonym for second-order stationary.
Wiener-Khinchin Theorem (also Wiener-Khintchine Theorem) – this allows us to relate
the two-point correlation function and the power spectrum through the Fourier transform. It shows
that the two contain exactly the same information. Although we do not discuss it in depth in this
note, it’s essential in allowing the spectral decomposition of the two-point correlation function. Find
its proof here or in Thorne & Blandford (2017).
wide-sense stationary – a third synonym for second-order stationary.
window function – a window function is a weighting function that describes how the values of
a random field change due to forces other than spatial clustering, such as foregrounds, instrumental
effects, and survey geometry (Karim et al. 2023). Gorce et al. (2023) shows how window functions

36
affect cosmological measurements. Estimating these functions is crucial in real-world measurements
of the power spectrum.
ACKNOWLEDGEMENTS
B.M. and S.B. contributed to all roles in writing this Note. We also outline the roles that each of the
authors led in the creation of this Note according to the Contributor Roles Taxonomy (CRediT)34.
B.M. led Writing – original draft, Data curation, and Methodology. S.B. led Software and Resources,
and both authors contributed equally to Visualization.
We thank all of our friends and colleagues for their fruitful conversations about this Note, including
Jose Fuentes Baeza, Aman Chokshi, Justin Clancy, Nicol`o Dalmasso, Dillon Dong, Ad´elie Gorce,
Bradley Greig, Lisa McBride, Shona McEvoy, Kevin Levy, Robert Pascua, Christian Reichardt,
Michele Trenti, and Stuart Wyithe. We also thank Tingjin Chu, Alex Clark, Veronica Dike, Nicholas
Rui, Amit Seta, and Neco Kriel for providing excellent written feedback to specific sections of this
Note. We would further like to thank the anonymous referee, whose careful reading of this manuscript
improved the quality of this work.
A special thank you goes to Tree Smith for giving the Masters’ talk that kickstarted the two-hour
argument between us which inspired us to write this Note.
BM thanks Alex Clark for math inspiration.
BM and SB are also incredibly thankful to Tong Cheunchitra for providing thorough comments on
this Note that gave us new insights we hadn’t thought of.
SB thanks Adrian Liu for a set of notes on the power spectra that were useful in its definition and
comparison to other methods.
BM acknowledges support from Australian Government Research Training Program (RTP) Schol-
arships and The David Lachlan Hay Memorial Fund. SB is supported by the Melbourne Research
Scholarship and N D Goldsworthy Scholarship for Physics. This research is supported in part by
the Australian Research Council Centre of Excellence for All Sky Astrophysics in 3 Dimensions (AS-
TRO 3D), through project number CE170100013. The majority of this research was conducted on
Wurundjeri, Ngunnawal (Ngunawal), and Ngambri land. Sovereignty was never ceded.
REFERENCES
Abell, G. O. 1958, ApJS, 3, 211,
doi: 10.1086/190036
Agterberg, F. P. 2004, Earth Sciences History, 23,
325. http://www.jstor.org/stable/24137099
Ar´evalo, P., Churazov, E., Zhuravleva, I.,
Hern´andez-Monteagudo, C., & Revnivtsev, M.
2012, MNRAS, 426, 1793,
doi: 10.1111/j.1365-2966.2012.21789.x
Benoit-L´evy, A., D´echelette, T., Benabed, K.,
et al. 2013, A&A, 555, A37,
doi: 10.1051/0004-6361/201321048
34 https://credit.niso.org/
Bensch, F., Stutzki, J., & Ossenkopf, V. 2001,
A&A, 366, 636,
doi: 10.1051/0004-6361:20000292
Blackman, R. B., & Tukey, J. W. 1958, The Bell
System Technical Journal, 37, 185,
doi: 10.1002/j.1538-7305.1958.tb03874.x
Bok, B. J. 1934, Nature, 133, 578,
doi: 10.1038/133578a0
Bravais, A. 1844, M´emoires pr´esent´es par divers
savants `a l’Acad´emie Royale des Siences de
l’Institut de France, 255–332
Champeney, D. C. 1987, A Handbook of Fourier
Theorems (Cambridge University Press)

37
Cheng, S., Ting, Y.-S., M´enard, B., & Bruna, J.
2020, MNRAS, 499, 5902,
doi: 10.1093/mnras/staa3165
Chiles, J.-P., & Delfiner, P. 1999, Geostatistics:
modeling spatial uncertainty (J. Wiley)
Clark, C. J. R., De Vis, P., Baes, M., et al. 2019,
MNRAS, 489, 5256, doi: 10.1093/mnras/stz2257
Cox, B., & Forshaw, J. R. 2012, The Quantum
Universe: (and Why Anything That Can
Happen, Does) (Da Capo Press.)
Einstein, A. 1915, Sitzungsberichte der
K&ouml;niglich Preussischen Akademie der
Wissenschaften, 844
Gonz´alez-Gait´an, S., de Souza, R. S.,
Krone-Martins, A., et al. 2019, MNRAS, 482,
3880, doi: 10.1093/mnras/sty2881
Gorce, A., Ganjam, S., Liu, A., et al. 2023,
Monthly Notices of the Royal Astronomical
Society, 520, 375–391,
doi: 10.1093/mnras/stad090
Hubble, E. 1929, Proceedings of the National
Academy of Science, 15, 168,
doi: 10.1073/pnas.15.3.168
—. 1934, ApJ, 79, 8, doi: 10.1086/143517
Jowett, G. H. 1955a, Journal of the Royal
Statistical Society: Series B (Methodological),
17, 208, doi: https:
//doi.org/10.1111/j.2517-6161.1955.tb00195.x
—. 1955b, Journal of the Royal Statistical Society
Series C: Applied Statistics, 4, 32,
doi: 10.2307/2985842
Karim, T., Rezaie, M., Singh, S., & Eisenstein, D.
2023, MNRAS, 525, 311,
doi: 10.1093/mnras/stad2210
Kolmogorov, A. 1941, Akademiia Nauk SSSR
Doklady, 30, 301
Krumholz, M. R., & Ting, Y.-S. 2018, MNRAS,
475, 2236, doi: 10.1093/mnras/stx3286
Li, Z., Wisnioski, E., Mendel, J. T., et al. 2023,
MNRAS, 518, 286, doi: 10.1093/mnras/stac3028
Liu, A., & Tegmark, M. 2011, Physical Review D,
83, doi: 10.1103/physrevd.83.103006
Marcotte, D. 1996, Computers and Geosciences,
22, 1175, doi: https:
//doi.org/10.1016/S0098-3004(96)00026-X
Matheron, G. 1963, Economic geology, 58, 1246
Metha, B., Trenti, M., Battisti, A., & Chu, T.
2024, arXiv e-prints, arXiv:2402.08903.
https://arxiv.org/abs/2402.08903
Metha, B., Trenti, M., & Chu, T. 2021, MNRAS,
508, 489, doi: 10.1093/mnras/stab2554
Metha, B., Trenti, M., Chu, T., & Battisti, A.
2022, MNRAS, 514, 4465,
doi: 10.1093/mnras/stac1484
Mowbray, A. G. 1938, PASP, 50, 275,
doi: 10.1086/124961
Obuhkov, A. 1941, Izv. Akad. Nauk. SSSR.
Ser.Geogr. i. Geofiz, 5, 453.
https://cir.nii.ac.jp/crid/1570854175144317312
Ossenkopf, V., Krips, M., & Stutzki, J. 2008,
A&A, 485, 917,
doi: 10.1051/0004-6361:20079106
Peebles, P. J. E. 1980, The large-scale structure of
the universe (Princeton University Press)
Penzias, A. A., & Wilson, R. W. 1965, ApJ, 142,
419, doi: 10.1086/148307
Raghunathan, S., Holder, G. P., Bartlett, J. G.,
et al. 2019, JCAP, 2019, 037,
doi: 10.1088/1475-7516/2019/11/037
Richardson, L. F. 1922, Weather Prediction by
Numerical Process (Cambridge University
Press)
Richardson, L. F. 1960, Statistics of deadly
quarrels (Boxwood Press)
Schabenberger, O., & Gotway, C. 2005, Statistical
Methods for Spatial Data Analysis (Chapman
and Hall/CRC)
Sefusatti, E., & Scoccimarro, R. 2005, Physical
Review D, 71, doi: 10.1103/physrevd.71.063001
Seta, A., Federrath, C., Livingston, J. D., &
McClure-Griffiths, N. M. 2023, MNRAS, 518,
919, doi: 10.1093/mnras/stac2972
Stutzki, J., Bensch, F., Heithausen, A., Ossenkopf,
V., & Zielinsky, M. 1998, A&A, 336, 697
Thorne, K. S., & Blandford, R. D. 2017, Modern
Classical Physics: Optics, Fluids, Plasmas,
Elasticity, Relativity, and Statistical Physics
Tobler, W. R. 1970, Economic Geography, 46, 234.
http://www.jstor.org/stable/143141
VanderPlas, J. T. 2018, The Astrophysical
Journal Supplement Series, 236, 16,
doi: 10.3847/1538-4365/aab766
Yaglom, A. M. 1990, Boundary-Layer
Meteorology, 53, v, doi: 10.1007/BF00122458
Yu, J. T., & Peebles, P. J. E. 1969, ApJ, 158, 103,
doi: 10.1086/150175
